-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ADR for remapping PublicationInformation #132
Conversation
} | ||
``` | ||
|
||
Notes: we should consider whether this really needs to be an array. 260 and 264 are repeatable fields in MARC, but whether they are regularly used in that way or if we could instead pick the "first" one that shows up and get a simpler model without losing any real world functionality would be worth investigating. For now it is modeled as an array of `PublicationInfo` assuming that is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From EngX's / API POV, what makes a single value simpler than an array? Is it parsing those results from the actual document in GraphQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's less EngX than end users I'm thinking about. Nested objects can be more difficult to work with in GraphQL than the equivalent top level field. And strings are easier than arrays to work with initially. This doesn't mean we should aim for totally flat records of entirely strings, just that when our data does not require objects or arrays, we should lean towards strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should make it an array given that it's possible for multiple occurrences, but I agree that it would be useful to find out how often that happens, if ever, to inform that decision. It would be kind of silly if we accommodated repeatable MARC fields that are never actually repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask about what sorts of warnings or logs are generated when Transmog encounters multiple values when it expects only one - but it seems that we've already decided not to try and impose a single-value requirement at this time?
At this time, I'm leaning towards Option 2 for a couple of reasons. First, I found this quite compelling:
Second, when you extend that thinking further, though I'm confident we will encounter publisher dates, it's more likely that we'd want those dates when already in the context (i.e. facet filter) of thinking about dates. This, as opposed to rooting around in an object for them. This might suggest a somewhat general pattern where some fields in TIMDEX benefit from having |
also adds config to allow adr cli to create an adr in the same directory other adrs are found in this repository https://mitlibraries.atlassian.net/browse/GDT-199
42fe8c4
to
aeb4afc
Compare
| GIS MIT | | PublicationInfo.Name | | | ||
| GIS OGM | | PublicationInfo.Name | | | ||
|
||
### Proposed mappings (Option 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JPrevost For Option 2, would it also require only choosing the first occurrence of the MARC 260 and 264 fields? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I believe similar to the discussion in Option 1, the fields are repeatable but it's unclear how frequently they are actually repeated in practice. I'd suggest we may want to look at our data to decide if this should be an Array of Strings or a single String. I suspect an Array of Strings will be where we end up but I'd hate to make that decision based on suspicion rather than data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to tease this out a bit more... if:
publisher
is a multivalued string field- and any other publisher information gets written to other fields like
dates
orlocations
(see comment below)
wouldn't this mean, for Option 2, that we could accomodate both MARC 260 and 264? Stated more generally, that Option 2 supports multiple publishers by virtue of the new publisher
field being multivalued and all other values mapped to other, multivalued fields as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. If we just make publisher
an array of Strings, we can support any source that currently or in the future has multiple values for all aspects of this (dates, names, locations).
If we make publisher
a single string, we could still support multiple Dates/Locations if that is the only thing that is truly multivalued in our sources. That may be a bit hard to tease out if that is all we truly need or if we should just accept that supporting multivalued is better because we just aren't sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to @jonavellecuerdo for initiating a look into this data, here is a spreadsheet that gives some insight into how publication_information
is currently used across sources: https://docs.google.com/spreadsheets/d/1zT4LlGQDyuwvxZDMnHGtgu45bnl2Kau42n_Njg2T4qQ/edit#gid=0.
It shows a maximum of ten rows per combination of source
+ publication_information
array length to get a sense of what individual values look like; ranging from 2-11 values in this array.
Also, here also is a breakdown of source
x array_length
for this field:
+-----------------------------------------------+------------+------------------+
|source |array_length|array_length_count|
+-----------------------------------------------+------------+------------------+
|MIT Alma |11 |1 |
|MIT Alma |10 |1 |
|MIT Alma |9 |2 |
|MIT Alma |8 |2 |
|MIT Alma |7 |6 |
|MIT Alma |6 |24 |
|Woods Hole Open Access Server |5 |1 |
|OpenGeoMetadata GIS Resources |5 |1 |
|MIT Alma |5 |80 |
|MIT Alma |4 |526 |
|OpenGeoMetadata GIS Resources |4 |25 |
|MIT GIS Resources |3 |1 |
|MIT Alma |3 |19345 |
|Woods Hole Open Access Server |3 |40 |
|OpenGeoMetadata GIS Resources |3 |426 |
|DSpace@MIT |2 |22 |
|Woods Hole Open Access Server |2 |1063 |
|MIT Alma |2 |317647 |
|OpenGeoMetadata GIS Resources |2 |89724 |
|MIT GIS Resources |2 |1986 |
|MIT Alma |1 |2721358 |
|MIT ArchivesSpace |1 |1288 |
|OpenGeoMetadata GIS Resources |1 |29687 |
|DSpace@MIT |1 |127517 |
|Woods Hole Open Access Server |1 |5135 |
|Zenodo |1 |4167 |
|LibGuides |1 |362 |
|MIT GIS Resources |1 |56 |
|Abdul Latif Jameel Poverty Action Lab Dataverse|1 |131 |
|Research Databases |null |0 |
|MIT Alma |null |0 |
|Woods Hole Open Access Server |null |0 |
|DSpace@MIT |null |0 |
+-----------------------------------------------+------------+------------------+
A couple of takeaways:
- the majority of records have 1 value (single valued)
- about 300k in Alma have 2 values
- BUT, many of the 2nd values are just copyright dates like
@2014
- BUT, many of the 2nd values are just copyright dates like
- Alma records with 3 values are more indicative of another actual publisher
- tail for 3+ sources are somewhat outliers
From this cursory look, it appears that supporting multiple publishers would be beneficial. Furthermore, there are lots of instances where the 2nd "publisher" in these arrays is just a date, which seems to lean into Option #2 where dates could be broken out separately, thereby avoiding a Publisher
object with nothing but a date while still getting that date included in the record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the peek into the data. This is extremely helpful context. I feel like I have enough info to add some additional context to the Options and propose a Decision (option 2 seems popular with the top level array for publisher names and everything else broken into Location and Dates).
|
||
Move some of the Alma data to other existing objects, and create a new top level `Publisher` array of strings. | ||
|
||
All sources except Alma remap to use `Publisher` instead of `PublicationInformation` with no other changes at this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about this line:
All sources except Alma remap to use
Publisher
instead ofPublicationInformation
with no other changes at this time.
I realized on my first read, I had made the assumption:
- all sources (including Alma) would begin to write the publisher name to this new multivalued string
publisher
field - future work: any sources (including Alma) that have publisher date or location information, could write to other appropriate fields, e.g.
dates
orlocations
with a qualifier like@kind="Publisher"
Does this align with your thinking and the wording here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can reword that better for sure. Your wording is what my wording intended to say :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JPrevost Does this change mean TimdexRecord.publication_information
/PublicationInformation
will be deprecated? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'd deprecate at the GraphQL layer PublicationInformation
in both Option 1 and Option 2. We'd map data to it from different places in the two options until we were confident it was no longer being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JPrevost - language updated in Option 2, per your agreement with a slight rewording, per this commit: 1dfcfab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with everyone else so far on Option 2, the decoupling from MARC will make the MARC transform logic more complicated but I think it's better in the long run
FWIW, this might suggest that Option 1 (array of complex publisher objects) aligns more closely with MODS: https://www.loc.gov/standards/mods/userguide/origininfo.html. In the MODS schema, the To me, the difference between Option 1 and 2 feels like it gets at a bigger data model decision: do we store subjects/dates/locations under complex objects like publishers, contributors, etc.? or do we store those as simple arrays of strings, and then qualify any subects/dates/locations with a The decision for this ADR, feels like it would set precedence for that going forward, as I'm not seeing this situation in any other fields (complex objects with child dates/locations). I think that I continue to support Option 2, and lean into that model:
To me, this is more flexibile in the long run. As we encounter subjects / dates / locations that should be in the record, but don't fall under a complex object, we can continue to slot them into those fields, but ensure (and document) what the |
Option 2 sounds good to me, especially after looking into |
In case any further coalescence is needed, I support Option 2 at this point, if not just for flexibility and ease of use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those cases where we have multiple publishers (not just multiple publication years), would Option 2 introduce potential confusion about the relationship between years, locations, and names among the publishers? Am I interpreting these options correctly with these arrangements?
Option 1
{
"publicationInfo": [
{
"name": "Massachusetts Institute of Technology",
"location": "Cambridge, MA",
"date": "2014"
},
{
"name": "Wiley",
"location": "Hoboken, NJ",
"date": "2016"
}
]
}
Option 2
{
"date": [
{
"kind": "Publication Date",
"value": "2016"
},
{
"kind": "Publication Date",
"value": "2014"
}
],
"location": [
{
"kind": "Publisher Location",
"value": "Cambridge, MA"
},
{
"kind": "Publisher Location",
"value": "Hoboken, NJ"
}
],
"publisher": [
"Massachusetts Institute of Technology",
"Wiley"
]
}
I'm not sure how much of an issue this is - it might be fine.
} | ||
``` | ||
|
||
Notes: we should consider whether this really needs to be an array. 260 and 264 are repeatable fields in MARC, but whether they are regularly used in that way or if we could instead pick the "first" one that shows up and get a simpler model without losing any real world functionality would be worth investigating. For now it is modeled as an array of `PublicationInfo` assuming that is appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask about what sorts of warnings or logs are generated when Transmog encounters multiple values when it expects only one - but it seems that we've already decided not to try and impose a single-value requirement at this time?
@matt-bernhardt great question. Yes, Option 1 provides an ability Option 2 does not that we hadn't stated (and I hadn't noticed). Namely, if there is value in keeping the publisher related date together when there are multiple publishers, Option 1 provides that in a way that Option 2 does not. Based on how infrequently there appear to be multiple real publishers (versus currently a lot of the extra publishers are actually just years inserted into the array), this may not be a concern. However, it should be noted as a distinct feature difference between the two options for sure and noted in the consequences section if we choose the one that would lose this ability to keep these fields together in an object. |
Agreed, that @matt-bernhardt's observation is an important one, namely that we decouple the publisher name from the its dates/locations and lose that tether. Maybe this is not problematic if we continue to view these TIMDEX records through a discovery lens, and not a canonical record lens. And, potentially even more accurate when we consider the behavior of facets. Consider this situation with two publishers:
If this record comes up in a search, and a user clicks the publisher facet for The flipside -- and what I assume is the concern raised above -- is that by decoupling them, we would lose the ability to look directly at this TIMDEX record and know:
I hesitate to suggest it... but what if we combined Option 1 & 2? {
"publishers": [
{
"name": "Great Writings",
"date": "1930", <----- use original string
"location": "Bend, OR"
},
{
"name": "Amazon Reprints",
"date": "2020", <----- use original string
"location": "Seattle, WA"
},
{
"name": "Ebooks Inc.",
"date": "Believed to be 2023 but records are hazy" <----- use original string
}
],
"date": [
{
"kind": "Published",
"value": "1980-01-01" <----- parsed as true dates, with an eye towards this desired state
},
{
"kind": "Published",
"value": "2020-01-01" <----- parsed as true dates, with an eye towards this desired state
}
],
"locations": [
{
"kind": "Published",
"value": "Bend, OR"
},
{
"kind": "Published",
"value": "Seattle, WA"
}
]
} In this scenario, we'd get:
There is duplication of information, but we'ev also normalized and filtered out bad values while still retaining the original which may have some meaning. It also might be less jarring to see this in the context of other fields. If the pattern is: store complex objects, but attempt to extract meaningful dates / subjects / locations from it and store in those respective fields, then maybe it doesn't seem as unusual? Lastly, from a Transmogrifier POV, we could focus on writing complex Publisher objects, then have shared methods and normalization logic across all sources that attempt to parse out distinct subjects, dates, locations, etc. Just food for thought! |
I still think discovery is the most appropriate lens to view TIMDEX and combining 1 & 2 feels like it's introducing a lot of complexity for accuracy in a likely rare occurrence of multiple legitimate publishers |
More comments forthcoming, but I did want to correct a mistake in my comment from last week. Specifically, that when selecting a facet value, we don't "lose" the other facet values from that particular record. See this example of selecting "Polygon data" as a facet value, but because the document contains other values for "Content Type", they still show up: ![]() I think this is an important clarification, as neither Option 1 or 2 result in the behavior I had suggested above. |
@JPrevost, @matt-bernhardt, @ehanson8, @jazairi, @jonavellecuerdo - an "Option 3" has been added per this commit that sketches a combination of Option 1 and Option 2. Matt's observation about Option 2 losing context between publisher names and dates/locations felt like an important one. Personally, this felt like a deal breaker for Option 2 for me. However, I feel Option 1 may continue to propagate some pain points we are feeling in the TimdexRecord / Opensearch / GraphQL / Search UI / Item Page UI relationships where we are required to dig into complex objects for aggregations. This alone may not be technically problematic, but it could be brittle in the longer term (see example in Option 3 where "Publisher" or "Date" sections start to drift between the search UI filters and the item page representation). It's hard to succinctly summarize, and might be beyond the scope of this ADR to some degree, but to me Option 3 represents a pattern that could be applied going forward: store complex objects to preserve context, but also extract subjects / dates / locations for ease of use and consistency of aggregation and item viewing. Perhaps Option 3 is not ultimately what we want, but hoping it can help clarify the strenghts and weaknesses of the current Options 1 and 2. At this time, my vote would be for Option 3. |
From this mini data dive, looks like we might have a fairly sizable number of legitimate publishers for Alma? I'd also propose that storing a complex object + extracting dates and locations really is the same complexity as option #2, we're just storing a bit more than the publisher name in the final record. The actual parsing and movement through the data should be the same. Furthermore, as noted in Option 3, I think it's possible we could have each source just focus on creation of the object, and have a global process extract dates and locations from the publisher objects, for all sources. But could discuss that implementation detail separately. |
I like Option 3. I think lines 165-168 of the ADR summarize it really well. I feel it benefits users by preventing information loss, and it also benefits developers by easing navigation of the data model. As a side note, |
Agreed @jonavellecuerdo about looking for opportunities to standardize the It may be out of scope for implementing the final decision of this particular ADR, but a follow-up ADR at some point around normalizing |
I'm fine with Option 3 as well. Standardizing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 3 sounds good to me, too. I agree with @ghukill that, from the perspective of the API consumer, this structure is no more complex to navigate than the one presented in Option 2. That it also provides richer and clearer data makes this an easy choice for me.
Thanks, all, for the nuanced and thoughtful discussion on this, leading to another excellent decision. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but one requested name change
|
||
Proceed with Option 3: | ||
|
||
- create new, top level, multivalued object field `PublicationInfo` with properties `[name, date, location]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be PublicationInformation
? We aren't abbreviating any of the other transmogrifier
field names or classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little fuzzy on GraphQL deprecation, but perhaps that is contributing to this variation on PublicationInformation
which already existed?
Is it correct to assume @JPrevost, @jazairi, @matt-bernhardt, that if we changed the field type of the pre-existing PublicationInformation
field, that doesn't give us the ability to really deprecate it? Because from GraphQL's POV the field name didn't change, so no ability to say, "This field is deprecated, use XYZ..."?
If so, @ehanson8 and @jazairi (as thumbs-uppers for this suggestion), what would you think about Publishers
as the complex object field name? My assumption is that "Information" or "Info" was originally proposed in the event we don't have a name, where just a date or location feels a little odd as a "Publisher". But, that's generally only the case for copyright dates, and we could elect to skip a Publisher
entry if only that copyright date, and instead just write that directly to Dates
.
All considered, I suppose my vote would be Publishers
as the top level field for complex objects, thereby matching other field name patterns like Contributors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with whatever so long as it's not abbreviated 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, good point about the name conflict. I'd be happy with Publishers
as an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I originally supported Option 2, the comment from @matt-bernhardt about losing the context of dates and locations with respect to the publisher name was a deal breaker for me.
After much discussion and a new option 3 added, I support this decision as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 (I can't approve as I opened the PR, but I support the decision to go with Option 3)
Why these changes are being introduced: We recently [made the decision in transmogrifier](MITLibraries/transmogrifier#132) To map publication information to a new nested field, named `publishers` to avoid a naming conflict with the existing `publicationInformation` field. Relevant ticket(s): * [GDT-270](https://mitlibraries.atlassian.net/browse/GDT-270) * [GDT-218](https://mitlibraries.atlassian.net/browse/GDT-218) * [GDT-229](https://mitlibraries.atlassian.net/browse/GDT-229) * [GDT-271](https://mitlibraries.atlassian.net/browse/GDT-271) How this addresses that need: This adds the new `publishers` field and deprecates `publicationInformation`. Side effects of this change: A follow-on ticket is needed to add this field to the UI. (This is ticketed as GDT-271.)
Why these changes are being introduced: We recently [made the decision in transmogrifier](MITLibraries/transmogrifier#132) To map publication information to a new nested field, named `publishers` to avoid a naming conflict with the existing `publicationInformation` field. Relevant ticket(s): * [GDT-270](https://mitlibraries.atlassian.net/browse/GDT-270) * [GDT-218](https://mitlibraries.atlassian.net/browse/GDT-218) * [GDT-229](https://mitlibraries.atlassian.net/browse/GDT-229) * [GDT-271](https://mitlibraries.atlassian.net/browse/GDT-271) How this addresses that need: This adds the new `publishers` field and deprecates `publicationInformation`. Side effects of this change: A follow-on ticket is needed to add this field to the UI. (This is ticketed as GDT-271.)
also adds config to allow adr cli to create an adr in the same directory other adrs are found in this repository
https://mitlibraries.atlassian.net/browse/GDT-199
Rendered markdown of current ADR: https://github.com/MITLibraries/transmogrifier/blob/gdt-199-publication-adr/docs/adrs/0003-support-aggregations-on-publisher-name.md
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES